Back

Journal of Computational Biology

SAGE Publications

Preprints posted in the last 30 days, ranked by how well they match Journal of Computational Biology's content profile, based on 37 papers previously published here. The average preprint has a 0.02% match score for this journal, so anything above that is already an above-average fit.

1
k-Nearest Common Leaves algorithm for phylogenetic tree completion

Koshkarov, A.; Tahiri, N.

2026-04-04 evolutionary biology 10.64898/2026.04.02.716144 medRxiv
Top 0.1%
6.5%
Show abstract

Phylogenetic trees represent the evolutionary histories of taxa and support tasks such as clustering and Tree of Life reconstruction. Many established comparison methods, including the Robinson-Foulds (RF) distance, assume identical taxon sets. A methodological gap remains for trees with distinct but overlapping taxa. Existing approaches either prune non-common leaves, which can discard information, or complete both trees such that they share the same taxa. Completion is more comprehensive, but current methods typically ignore branch lengths, which are essential for identifying evolutionary patterns. This paper introduces k-Nearest Common Leaves (k-NCL), an algorithm for completing rooted phylogenetic trees defined on different but overlapping taxa. The method uses branch lengths and topological characteristics and does not rely on a specific distance measure. The k-NCL algorithm is designed to preserve evolutionary relationships in the trees under comparison. The running time is O(n2), where n is the size of the union of the two leaf sets. Additional properties include preservation of original distances and topology, symmetry, and uniqueness of the completion. Implemented in Python, k-NCL is evaluated on biological datasets of amphibians, birds, mammals, and sharks. Experimental results show that RF combined with k-NCL improves phylogenetic tree clustering performance compared to the RF(+) tree completion approach. Availability and implementationAn open-source implementation of k-NCL in Python and the datasets used in this study are available at https://github.com/tahiri-lab/KNCL.

2
Analysis of biological networks using Krylov subspace trajectories

Frost, H. R.

2026-03-31 bioinformatics 10.64898/2026.03.29.715092 medRxiv
Top 0.1%
3.7%
Show abstract

We describe an approach for analyzing biological networks using rows of the Krylov subspace of the adjacency matrix. Specifically, we explore the scenario where the Krylov subspace matrix is computed via power iteration using a non-random and potentially non-uniform initial vector that captures a specific biological state or perturbation. In this case, the rows the Krylov subspace matrix (i.e., Krylov trajectories) carry important functional information about the network nodes in the biological context represented by the initial vector. We demonstrate the utility of this approach for community detection and perturbation analysis using the C. Elegans neural network.

3
Estimating Bayesian phylogenetic information content using geodesic distances

Milkey, A.; Lewis, P. O.

2026-04-01 evolutionary biology 10.64898/2026.03.31.715656 medRxiv
Top 0.1%
3.2%
Show abstract

AO_SCPLOWBSTRACTC_SCPLOWA new Bayesian measure of phylogenetic information content is introduced based on geodesic distances in treespace. The measure is based on the relative variance of phylogenetic trees sampled from the posterior distribution compared to the prior distribution. This ratio is expected to equal 1 if there is no information in the data about phylogeny and 0 if there is complete information. Trees can be scaled to have the same mean tree length to avoid dominance by edge length information and focus on topological information. The method scales well, requiring only that a valid sample can be obtained from both prior and posterior distributions. We show how dissonance (information conflict) among data sets can also be estimated. Both simulated and empirical examples are provided to illustrate that the new approach produces sensible and intuitive results.

4
On the Comparison of LGT networks and Tree-based Networks

Marchand, B.; Tahiri, N.; Tremblay-Savard, O.; Lafond, M.

2026-04-01 bioinformatics 10.1101/2025.11.20.689557 medRxiv
Top 0.1%
2.7%
Show abstract

Phylogenetic networks are widespread representations of evolutionary histories for taxa that undergo hybridization or Lateral-Gene Transfer (LGT) events. There are now many tools to reconstruct such networks, but no clearly established metric to compare them. Such metrics are needed, for example, to evaluate predictions against a simulated ground truth. Despite years of effort in developing metrics, known dissimilarity measures either do not distinguish all pairs of different networks, or are extremely difficult to compute. Since it appears challenging, if not impossible, to create the ideal metric for all classes of networks, it may be relevant to design them for specialized applications. In this article, we introduce a metric on LGT networks, which consist of trees with additional arcs that represent lateral gene transfer events. Our metric is based on edit operations, namely the addition/removal of transfer arcs, and the contraction/expansion of arcs of the base tree, allowing it to connect the space of all LGT networks. We show that it is linear-time computable if the order of transfers along a branch is unconstrained but NP-hard otherwise, in which case we provide a fixed-parameter tractable (FPT) algorithm in the level. We implemented our algorithms and demonstrate their applicability on three numerical experiments. Full online versionhttps://www.biorxiv.org/content/10.1101/2025.11.20.689557

5
Interpolating and Extrapolating Node Counts in Colored Compacted de Bruijn Graphs for Pangenome Diversity

Parmigiani, L.; Peterlongo, P.

2026-03-18 bioinformatics 10.64898/2026.03.16.711983 medRxiv
Top 0.1%
2.1%
Show abstract

A pangenome is a collection of taxonomically related genomes, often from the same species, serving as a representation of their genomic diversity. The study of pangenomes, or pangenomics, aims to quantify and compare this diversity, which has significant relevance in fields such as medicine and biology. Originally conceptualized as sets of genes, pangenomes are now commonly represented as pangenome graphs. These graphs consist of nodes representing genomic sequences and edges connecting consecutive sequences within a genome. Among possible pangenome graphs, a common option is the compacted de Bruijn graph. In our work, we focus on the colored compacted de Bruijn graph, where each node is associated with a set of colors that indicate the genomes traversing it. In response to the evolution of pangenome representation, we introduce a novel method for comparing pangenomes by their node counts, addressing two main challenges: the variability in node counts arising from graphs constructed with different numbers of genomes, and the large influence of rare genomic sequences. We propose an approach for interpolating and extrapolating node counts in colored compacted de Bruijn graphs, adjusting for the number of genomes. To tackle the influence of rare genomic sequences, we apply Hill numbers, a well-established diversity index previously utilized in ecology and metagenomics for similar purposes, to proportionally weight both rare and common nodes according to the frequency of genomes traversing them.

6
Benchmark of biomarker identification and prognostic modeling methods on diverse censored data

Fletcher, W. L.; Sinha, S.

2026-04-01 bioinformatics 10.64898/2026.03.29.715113 medRxiv
Top 0.1%
1.9%
Show abstract

The practices of identifying biomarkers and developing prognostic models using genomic data has become increasingly prevalent. Such data often features characteristics that make these practices difficult, namely high dimensionality, correlations between predictors, and sparsity. Many modern methods have been developed to address these problematic characteristics while performing feature selection and prognostic modeling, but a large-scale comparison of their performances in these tasks on diverse right-censored time to event data (aka survival time data) is much needed. We have compiled many existing methods, including some machine learning methods, several which have performed well in previous benchmarks, primarily for comparison in regards to variable selection capability, and secondarily for survival time prediction on many synthetic datasets with varying levels of sparsity, correlation between predictors, and signal strength of informative predictors. For illustration, we have also performed multiple analyses on a publicly available and widely used cancer cohort from The Cancer Genome Atlas using these methods. We evaluated the methods through extensive simulation studies in terms of the false discovery rate, F1-score, concordance index, Brier score, root mean square error, and computation time. Of the methods compared, CoxBoost and the Adaptive LASSO performed well in all metrics, and the LASSO and elastic net excelled when evaluating concordance index and F1-score. The Benjamini-Hoschberg and q-value procedures showed volatile performances in controlling the false discovery rate. Some methods performances were greatly affected by differences in the data characteristics. With our extensive numerical study, we have identified the best performing methods for a plethora of data characteristics using informative metrics. This will help cancer researchers in choosing the best approach for their needs when working with genomic data.

7
A New Information Theoretic Approach Shows that Mixture Models Outperform Partitioned Models for Phylogenetic Analyses of Amino Acid Data

Ren, H.; Jiang, C.; Wong, T. K. F.; Shao, Y.; Susko, E.; Minh, B. Q.; Lanfear, R.

2026-03-18 evolutionary biology 10.64898/2026.03.16.712229 medRxiv
Top 0.2%
1.7%
Show abstract

Partitioned and mixture models are widely employed in Maximum Likelihood phylogenetic analyses of large genomic datasets. Comparing the fit of the two types of models has been challenging, because standard information-theoretic approaches cannot be applied. Mixture models are increasingly popular for the analysis of amino acid datasets and can lead to different conclusions compared to partitioned models. This raises an important question - which type of model tends to perform better? Susko et al. (2026) recently introduced the marginal Akaike information criterion (mAIC), which allows mixture models and partitioned models to be directly compared for the first time. Here, we use the mAIC and a range of other approaches to compare the fit of mixture and partitioned models across a diverse set of empirical datasets. We show that mixture models are universally favoured on amino acid datasets. This has important implications for interpreting empirical analyses and suggests that continued development of mixture models is an important avenue for future research.

8
Ancestral state reconstruction with discrete characters using deep learning

Nagel, A. A.; Landis, M. J.

2026-03-21 evolutionary biology 10.64898/2026.03.19.712918 medRxiv
Top 0.2%
1.7%
Show abstract

Ancestral state reconstruction is a classical problem of broad relevance in phylogenetics. Likelihood-based methods for reconstructing ancestral states under discrete character models, such as Markov models, have proven extremely useful, but only work so long as the assumed model yields a tractable likelihood function. Unfortunately, extending a simple but tractable phylogenetic model to possess new, but biologically realistic, properties often results in an intractable likelihood, preventing its use in standard modeling tasks, including ancestral state reconstruction. The rapid advancement of deep learning offers a potential alternative to likelihood-based inference of ancestral states, particularly for models with intractable likelihoods. In this study, we modify the phylogenetic deep learning software O_SCPLOWPHYDDLEC_SCPLOW to conduct ancestral state reconstruction. We evaluate O_SCPLOWPHYDDLEC_SCPLOWs performance under various methodological and modeling conditions, while comparing to Bayesian inference when possible. For simple models and small trees, its performance resembles the performance of Bayesian inference, but worsens as tree size increases. While O_SCPLOWPHYDDLEC_SCPLOW still performs adequately for more complex models, such as speciation and extinction models, the estimates differ more from Bayesian inference in comparison with simpler models. Lastly, we use O_SCPLOWPHYDDLEC_SCPLOW to infer ancestral states for two empirical datasets, one of the ancestral ranges of a subclade of the genus Liolaemus and ancestral locations for sequences from the 2014 Sierra Leone Ebola virus disease outbreak.

9
An abstract model of nonrandom, non-Lamarckian mutation in evolution using a multivariate estimation-of-distribution algorithm

Vasylenko, L.; Livnat, A.

2026-04-01 evolutionary biology 10.64898/2026.03.30.715341 medRxiv
Top 0.2%
1.7%
Show abstract

At the fundamental conceptual level, two alternatives have traditionally been considered for how mutations arise and how evolution happens: 1) random mutation and natural selection, and 2) Lamarckism. Recently, the theory of Interaction-based Evolution (IBE) has been proposed, according to which mutations are neither random nor Lamarckian, but are influenced by information accumulating internally in the genome over generations. Based on the estimation-of-distribution algorithms framework, we present a simulation model that demonstrates nonrandom, non-Lamarckian mutation concretely while capturing indirectly several aspects of IBE: selection, recombination, and nonrandom, non-Lamarckian mutation interact in a complementary fashion; evolution is driven by the interaction of parsimony and fit; and random bits do not directly encode improvement but enable generalization by the manner in which they connect with the rest of the evolutionary process. Connections are drawn to Darwins observations that changed conditions increase the rate of production of heritable variation; to the causes of bell-shaped distributions of traits and how these distributions respond to selection; and to computational learning theory, where analogizing evolution to learning in accord with IBE casts individuals as examples and places the learned hypothesis at the population level. The model highlights the importance of incorporating internal integration of information through heritable change in both evolutionary theory and evolutionary computation.

10
10-minimizers: a promising class of constant-space minimizers

Shur, A.; Tziony, I.; Orenstein, Y.

2026-03-18 bioinformatics 10.64898/2026.03.16.712052 medRxiv
Top 0.3%
1.3%
Show abstract

Minimizers are sampling schemes which are ubiquitous in almost any high-throughput sequencing analysis. Assuming a fixed alphabet of size{sigma} , a minimizer is defined by two positive integers k, w and a linear order{rho} on k-mers. A sequence is processed by a sliding window algorithm that chooses in each window of length w + k- 1 its minimal k-mer with respect to{rho} . A key characteristic of a minimizer is its density, which is the expected frequency of chosen k-mers among all k-mers in a random infinite{sigma} -ary sequence. Minimizers of smaller density are preferred as they produce smaller samples, which lead to reduced runtime and memory usage in downstream applications. Recent studies developed methods to generate minimizers with optimal and near-optimal densities, but they require to explicitly store k-mer ranks in{Omega} (2k) space. While constant-space minimizers exist, and some of them are proven to be asymptotically optimal, no constant-space minimizers was proven to guarantee lower density compared to a random minimizer in the non-asymptotic regime, and many minimizer schemes suffer from long k-mer key-retrieval times due to complex computation. In this paper, we introduce 10-minimizers, which constitute a class of minimizers with promising properties. First, we prove that for every k > 1 and every w[≥] k- 2, a random 10-minimizer has, on expectation, lower density than a random minimizer. This is the first provable guarantee for a class of minimizers in the non-asymptotic regime. Second, we present spacers, which are particular 10-minimizers combining three desirable properties: they are constant-space, low-density, and have small k-mer key-retrieval time. In terms of density, spacers are competitive to the best known constant-space minimizers; in certain (k, w) regimes they achieve the lowest density among all known (not necessarily constant-space) minimizers. Notably, we are the first to benchmark constant-space minimizers in the time spent for k-mer key retrieval, which is the most fundamental operation in many minimizers-based methods. Our empirical results show that spacers can retrieve k-mer keys in competitive time (a few seconds per genome-size sequence, which is less than required by random minimizers), for all practical values of k and w. We expect 10-minimizers to improve minimizers-based methods, especially those using large window sizes. We also propose the k-mer key-retrieval benchmark as a standard objective for any new minimizer scheme.

11
Homology-based perspective on pangenome graphs

Lisiecka, A.; Kowalewska, A.; Dojer, N.

2026-03-18 bioinformatics 10.64898/2026.03.16.712038 medRxiv
Top 0.3%
1.2%
Show abstract

Pangenome graphs conveniently represent genetic variation within a population. Several types of such graphs have been proposed, with varying properties and potential applications. Among them, variation graphs (VGs) seem best suited to replace reference genomes in sequencing data processing, while whole genome alignments (WGAs) are particularly practical for comparative genomics applications. For both models, no widely accepted optimization criteria for a graph representing a given set of genomes have been proposed. In the current paper we introduce the concept of homology relation induced by a pangenome graph on the characters of represented genomic sequences and define such relations for both VG and WGA model. Then, we use this concept to propose homology-based metrics for comparing different graphs representing the same genome collection, and to formulate the desired properties of transformations between VG and WGA models. Moreover, we propose several such transformations and examine their properties on pangenome graph data. Finally, we provide implementations of these transformations in a package WGAtools, available at https://github.com/anialisiecka/WGAtools.

12
Outperforming the Majority-Rule Consensus Tree Using Fine-Grained Dissimilarity Measures

Takazawa, Y.; Takeda, A.; Hayamizu, M.; Gascuel, O.

2026-03-18 bioinformatics 10.64898/2026.03.16.712085 medRxiv
Top 0.4%
1.0%
Show abstract

Phylogenetic analyses often require the summarization of multiple trees, e.g., in Bayesian analyses to obtain the centroid of the posterior distribution of trees, or to determine the consensus of a set of bootstrap trees. The majority-rule consensus tree is the most commonly used. It is easy to compute and minimizes the sum of Robinson-Foulds (RF) distances to the input trees. In mathematical terms, the majority-rule consensus tree is the median of the input trees with respect to the RF distance. However, due to the coarse nature of RF distance, which only considers whether two branches induce exactly the same bipartition of the taxa or not, highly unresolved trees can be produced when the phylogenetic signal is low. To overcome this limitation, we propose using median trees with respect to finer-grained dissimilarity measures between trees. These measures include a quartet distance between tree topologies, and transfer distances, which quantify the similarity between bipartitions, in contrast to the 0/1 view of RF. We describe fast heuristic consensus algorithms for transfer-based tree dissimilarities, capable of efficiently processing trees with thousands of taxa. Through evaluations on simulated datasets in both Bayesian and bootstrapping maximum-likelihood frameworks, our results show that our methods improve consensus tree resolution in scenarios with low to moderate phylogenetic signal, while providing better or comparable dissimilarities to the true phylogeny. Applying our methods to Mammal phylogeny and a large HIV dataset of over nine thousand taxa confirms the improvement with real data. These results demonstrate the usefulness of our new consensus tree methods for analyzing the large datasets that are available today. Our software, PhyloCRISP, is available from https://github.com/yukiregista/PhyloCRISP.

13
How much information is there for inferring species trees?

Milkey, A.; Chen, J.; Lewis, P. O.

2026-04-02 evolutionary biology 10.64898/2026.04.01.715836 medRxiv
Top 0.6%
0.7%
Show abstract

AO_SCPLOWBSTRACTC_SCPLOWAs modern phylogenomics datasets become increasingly large, it is useful to develop recommendations for how to subsample datasets for best species tree inference. Here we apply a new measure of phylogenetic information content that estimates the reduction in tree space occupied by a posterior sample of inferred trees relative to a prior sample in order to assess the effects of gene tree parameters on species tree estimation. We find that, consistent with earlier studies, when data are informative, more data result in better species tree inference. However, when data are uninformative, subsampling a dataset to include only the most informative loci may produce a better species tree sample. We perform analyses on a variety of simulated and empirical datasets.

14
Correlation Between Information Entropy and Functions of Gene Sequences in the Evolutionary Context: A New Way to Construct Gene Regulatory Networks from Sequence

Pan, L.; Chen, M.; Tanik, M.

2026-04-07 bioinformatics 10.64898/2026.04.03.714856 medRxiv
Top 0.6%
0.7%
Show abstract

The information encoded in DNA sequences can be rigorously quantified using Shannon entropy and related measures. When placed in an evolutionary context, this quantification offers a principled yet underexplored route to constructing gene regulatory networks (GRNs) directly from sequence data. While most GRN inference methods rely exclusively on gene expression profiles, the regulatory code is ultimately written in the DNA sequence itself. Here we review the mathematical foundations of information theory as applied to gene sequences, survey existing computational methods for GRN inference--with emphasis on information-theoretic and sequence-based approaches--and examine how evolutionary conservation constrains sequence entropy to preserve biological function. We then propose a four-layer integrative framework that combines per-position Shannon entropy profiles, evolutionary conservation scoring via Jensen- Shannon divergence, expression-based mutual information and transfer entropy, and DNA foundation model embeddings to construct GRNs from sequence. Through worked examples on the Escherichia coli SOS regulatory sub-network, we demonstrate how conservation-weighted mutual information improves edge discrimination and how transfer entropy resolves regulatory directionality. The framework generates testable predictions: edges supported by low-entropy regulatory regions should show higher experimental validation rates, and cross-species entropy profile conservation should predict GRN topology conservation. This work bridges three scales of biological information--nucleotide-level entropy, evolutionary constraint patterns, and network-level regulatory logic--establishing information entropy as the natural mathematical language for sequence-to-network regulatory inference.

15
Horse, not zebra: accounting for lineage abundance in maximum likelihood phylogenetics

De Maio, N.

2026-03-27 bioinformatics 10.64898/2026.03.25.714173 medRxiv
Top 0.7%
0.7%
Show abstract

Maximum likelihood phylogenetic methods are popular approaches for estimating evolutionary histories. These methods do not assume prior hypotheses regarding the shape of the phylogenetic tree, and this lack of prior assumptions can be useful in particular in case of idiosyncratic sampling patterns. For example, the rate at which species are sequenced can differ widely between lineages, with lineages more of interest to humans being usually sequenced more often than others. However, in some settings sampling can be lineage-agnostic. In genomic epidemiology, for example, the sequencing rate can change through time or across locations, but is often agnostic to the specific pathogen strain being sequenced. In this scenario, one expects that the abundance of a pathogen strain at a specific time and location in the host population is reflected in the relative abundance of that strain among the genomes sequenced at that time and location. Here, I show that this simple assumption, when appropriate and incorporated within maximum likelihood phylogenetics, can greatly improve the accuracy of phylogenetic inference. This is similar to the famous medical principle "when you hear hoofbeats, think of horses, not zebras". In our application this means that, when for example observing a (possibly incomplete) genome sequence that has a similar likelihood of belonging to multiple different strains, I aim to prioritize phylogenetic placement onto a common strain (the "horse", a common disease) rather than a rare one (the "zebra", a rare disease). I introduce and assess two separate approaches to achieve this. The first approach rescales the likelihood of a phylogenetic tree by the number of distinct binary topologies obtainable by arbitrarily resolving multifurcations in the tree. This approach is based on a new interpretation of multifurcating phylogenetic trees particularly relevant at low divergence: multifurcations represent a lack of signal for resolving the bifurcating topology rather than an instantaneous multifurcating event, and so a multifurcating tree is interpreted as the set of bifurcating trees consistent with the multifurcating one, rather than as a single multifurcating topology. The second approach instead includes a tree prior that assumes that genomes are sequenced at a rate proportional to their abundance. Both approaches favor phylogenetic placement at abundant lineages, and using simulations I show that both methods dramatically improve the accuracy of phylogenetic inference in scenarios like SARS-CoV-2 phylogenetics, where large multifurcations are common. This considerable impact is also observed in real pandemic-scale SARS-CoV-2 genome data, where accounting for lineage prevalence reduces phylogenetic uncertainty by around one order of magnitude. Both approaches were implemented as part of the free and open source phylogenetic software MAPLE v0.7.5.4 (https://github.com/NicolaDM/MAPLE).

16
New Space-Time Tradeoffs for Subset Rank and k-mer Lookup

Diseth, A. C.; Puglisi, S. J.

2026-03-18 bioinformatics 10.64898/2026.03.16.712042 medRxiv
Top 0.7%
0.7%
Show abstract

Given a sequence S of subsets of symbols drawn from an alphabet of size{sigma} , a subset rank query srank(i, c) asks for the number of subsets before the ith subset that contain the symbol c. It was recently shown (Alanko et al., Proc. SIAM ACDA, 2023) that subset rank queries on the spectral Burrows-Wheeler lead to efficient k-mer lookup queries, an essential and widespread task in genomic sequence analysis. In this paper we design faster subset rank data structures that use small space--less than 3 bits per k-mer. Our experiments show that this translates to new Pareto optimal SBWT-based k-mer lookup structures at the low-memory end of the space-time spectrum.

17
Castration-resistant prostate cancer cells are addicted to the high activity of cyclin-dependent kinase 2

Chatterjee, J.; Marin, A.; Yalala, S.; Itkonen, H. M.

2026-03-18 cancer biology 10.64898/2026.03.17.712428 medRxiv
Top 0.9%
0.5%
Show abstract

BackgroundCyclin-dependent kinases drive the progression through the cell cycle and thereby form classical targets for cancer therapy. In prostate cancer (PC), the first line of therapy typically targets androgen receptor (AR), but it frequently leads to development of incurable form of the disease, castration-resistant PC (CRPC). Here, we sought to understand if CRPC cells are selectively addicted to a specific cell cycle kinase. MethodsWe used PC and CRPC patient data to evaluate transcriptional changes and modeled the responses in vitro using multiple models of PC, CRPC and normal cells. Development of a CDK2 inhibitor-resistant CRPC cell line, and a compound screen were used to identify chronic and acute vulnerabilities to augment the efficacy of our candidate therapy in multiple PC, CRPC and also normal cells, to assure selectivity. ResultsWe show that the emergence of CRPC is associated with significant upregulation of cyclins that positively regulate cyclin-dependent kinase 2 (CDK2) and downregulation of CDK4 cyclins. Accordingly, CDK2-specific inhibitors and its knock down efficiently reduce proliferation of PC and CRPC cells. CDK2 inhibitor-resistant CRPC model displayed transcriptional rewiring of cell cycle regulators, characterized by a shift towards CDK4/6-dependency and increased AR-signaling. Combinatorial drug screen discovered both antagonistic and additive combinations, and we show that AR inhibitors selectively augment the efficacy of CDK2 inhibitors against PC and CRPC cells, but the combination is not toxic to normal cells. ConclusionWe discovered that CRPC cells are addicted to high CDK2 activity and show that combination of CDK2 inhibitors with the currently used anti-CRPC therapies selectively augment their efficacy.

18
Learning gene interactions from tabular gene expression data using Graph Neural Networks

Boulougouri, M.; Nallapareddy, M. V.; Vandergheynst, P.

2026-03-23 bioinformatics 10.64898/2026.03.19.712949 medRxiv
Top 0.9%
0.5%
Show abstract

Gene interactions form complex networks underlying disease susceptibility and therapeutic response. While bulk transcriptomic datasets offer rich resources for studying these interactions, applying Graph Neural Networks (GNNs) to such data remains limited by a lack of methodological guidance, especially for constructing gene interaction graphs. We present REGEN (REconstruction of GEne Networks), a GNN-based framework that simultaneously learns latent gene interaction networks from bulk transcriptomic profiles and predicts patient vital status. Evaluated across seven cancer types in the TCGA cohort, REGEN outperforms baseline models in five datasets and provides robust network inference. By systematically comparing strategies for initializing gene-gene adjacency matrices, we derive practical guidelines for GNN application to bulk transcriptomics. Analysis of the learned kidney cancer gene-network reveals cancer-related pathways and biomarkers, validating the models biological relevance. Together, we establish a principled approach for applying GNNs to bulk transcriptomics, enabling improved phenotype prediction and meaningful gene network discovery.

19
scRGCL: Neighbor-Aware Graph Contrastive Learning for Robust Single-Cell Clustering

Fan, J.; Liu, F.; Lai, X.

2026-03-18 bioinformatics 10.64898/2026.03.16.712039 medRxiv
Top 0.9%
0.5%
Show abstract

Accurate cell type identification is a fundamental step in single-cell RNA sequencing (scRNA-seq) data analysis, providing critical insights into cellular heterogeneity at high resolution. However, the high dimensionality, zero-inflated, and long-tailed distribution of scRNA-seq data pose significant computational challenges for conventional clustering approaches. Although recent deep learning-based methods utilize contrastive learning to joint-learn representations and clustering assignments, they often overlook cluster-level information, leading to suboptimal feature extraction for downstream tasks. To address these limitations, we propose scRGCL, a single-cell clustering method that learns a regularized representation guided by contrastive learning. Specifically, scRGCL captures the cell-type-associated expression structure by clustering similar cells together while ensuring consistency. For each sample, the model performs negative sampling by selecting cells from distinct clusters, thereby ensuring semantic dissimilarity between the target cell and its negative pairs. Moreover, scRGCL introduces a neighbor-aware re-weighting strategy that increases the contribution of samples from clusters closely related to the target. This mechanism prevents cells from the same category from being mistakenly pushed apart, effectively preserving intra-cluster compactness. Extensive experiments on fourteen public datasets demonstrate that scRGCL consistently outperforms state-of-the-art methods, as evidenced by significant improvements in normalized mutual information (NMI) and adjusted rand index (ARI). Moreover, ablation studies confirm that the integration of cluster-aware negative sampling and the neighbor-aware re-weighting module is essential for achieving high-fidelity clustering. By harmonizing cell-level contrast with cluster-level guidance, scRGCL provides a robust and scalable framework that advances the precision of automated cell-type discovery in increasingly complex single-cell landscapes. Key MessagesO_LIscRGCL uses contrastive learning on a regularized representation for single-cell clustering. C_LIO_LIscRGCL outperforms four state-of-the-art methods on 15 datasets. C_LIO_LIscRGCLs cluster-aware negative sampling and the neighbor-aware re-weighting modules are essential for high-fidelity single cell clustering. C_LI

20
Novabrowse: A Tool for High-Resolution Synteny Analysis, Ortholog Detection, and Gene Signal Discovery

Rikk, L.; Ghaffarinia, A.; Leigh, N. D.

2026-03-30 genomics 10.64898/2026.03.27.714371 medRxiv
Top 0.9%
0.5%
Show abstract

Accurate genome annotation remains challenging as assembly quality often exceeds annotation reliability. Resolving ambiguities of gene presence, absence, and orthology typically requires integrating two complementary lines of evidence: sequence homology between species and the conservation of gene order (i.e., synteny). BLAST remains the standard for homology detection, yet its raw output can be difficult to interpret. Existing tools address this challenge but operate at opposing scales. Alignment viewers provide detailed pairwise statistics without genomic context, while synteny tools offer chromosome-scale perspectives without sequence-level resolution. To fill this intermediate gap, we developed Novabrowse, an interactive BLAST results interpretation framework featuring high-resolution multi-species synteny analysis, chromosomal re-arrangement investigation, ortholog detection, and gene signal discovery. Users define a genomic region of interest in a query species and/or use custom sequences, then select one or more subject species for comparison. The pipeline retrieves query gene sequences via NCBI API integration and performs BLAST searches against each subject transcriptome or genome. Results are presented via an interactive HTML file featuring alignment statistics, chromosomal maps, coverage visualizations, ribbon plots, and distance-based clustering of high-scoring segment pairs into putative gene units. We demonstrate these capabilities by investigating Foxp3, Aire, and Rbl1, three highly conserved vertebrate genes, in the recently assembled genome of the newt Pleurodeles waltl. Foxp3 and Aire have not been described in any salamander species to date, despite availability of multiple assemblies and extensive transcriptomic datasets. Using Novabrowse, we discovered conserved loci and gene signals for both genes in P. waltl, the presence of which was subsequently confirmed via Nanopore long-read RNA sequencing. In contrast, Rbl1 analysis uncovered a chromosomal rearrangement at its expected locus with no gene signal detected, indicating a gene loss specific to P. waltl despite the genes retention in the closely related axolotl (Ambystoma mexicanum). Our findings demonstrate Novabrowses capacity for evidence-based evaluation of annotation artifacts, an essential capability as high-quality assemblies become more available for phylogenetically diverse species. Novabrowse is open source (MIT license) and freely available at: https://github.com/RegenImm-Lab/Novabrowse.